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ABSTRACT 

The testing of bilingual students poses particular 
problems for analyses of performance, item bias, and test adequacy. A 
test administered in two languages to children selected for their 
language facility provides a special arena for the study of these 
problems. The Comprehensive Tests of Basic Skills (CTBS) was selected 
because the test content between the English and Spanish language 
versions is similar in rationale, administration, and interpretation; 
the differences that exist between the language versions are the 
result of literal translation problems. Evidence based on performance 
of English- and Spanish-speaking pupils suggests that the CTBS 
contains multiple sources of bias. The vocabulary subtest of the CTBS 
was administered to 1162 second-graders in bilingual education 
programs in the Southwest; 58 students received both test versions 
because the students were equally proficient ir\ both languages. 
Results show that patterns of performance for these students differ 
markedly between the two language versions, supporting the contention 
that the method of direct translation from English to Spanish for 
bilingual vocabulary testing may not be fully adequate for the needs 
of the bilingual program student, even when the Spanish version is a 
rather faithful translation of the English original. (Author/PN) 
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ABSTRACT 



The testing of bilingual students poses oarticular nroblems 
for analyses of oerformance, item bias, and test adecruacv. 
When .children are selected for their facility in two lanauages, 
and the same test is administered in both lanauaaes, a special 
arena is provided for the study of these problems. A widely- 
used test, the Ccmorehansive Tests of Basic Skills, is avail- 
able in both English and Spanish. The vocabularv subtest was 
administered to 1162 second-graders in bilingual education pro- 
grams throughout the Southwest, as oart of a larger studv; 
58 of those students received? both versions of the test because 
they were deemed equally proficient in both languages. Results 
show that patterns of oerformance for these students differ 
markedly between the two versions, and suagest that the test 
differs in important dimensions even though the Spanish version 
is a rather faithful translation of the English original. 
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INTRODUCTION 

Severe problems confront the evaluation of bilinaual program 
students from the standpoint of both individual performance 
measurement and the potential for bias in testing. Assessing 
the student in the maioritV language runs one set of risks; 
assessing in the native tongue runs another.. The number of 
studies which have successfully assessed a single skill in two 
languages for the same individuals is exceedingly small 
(Duran, 1980). Resolution of these problems is not aided by 
the current controversy surrounding both the definition and 
measurement of bilingualism itself (De Avila, 1978.) More- 
over, thoroughly contradictory findings emerge from studies of 
the acquisition of French by native English-speakina children 
in Canada (Lambert & Tucker, 1972), of Swedish by native Finnish- 
speaking children in Scandinavia (Skutnabb-Kangas & Toukomaa, 
19 76) , and of English by native Spanish-speaking children in 
the U.S. (Fischer & Cabello, 1978). The intearation of such . 
differences may rest in part on linguistic , developmental, and/or 
sociocultural interpretations (Troike, 3 978)? a practical \ 
level of shared bilingualism or dominance of one languaqe 
over the ether in the community mav also olav a strong role 
(Laosa, 1975). Finnish-speaking children from the populous 
southern districts find, and potentially model, both Finnish 
and Swedish in almost every shoo window, w^ile the politics 
of separatism are explicit in Quebec and de facto in many 
areas of the American Southwest, so children from these regions 
may encounter the second language with mixed emotions. As- 
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sessing even a relatively simple arena like vocabulary skills 
becomes multiply compounded when dealing with students who 
must cope with two lanauages. 

Measuring the skills of bilingual program students also 
means assessing whether tests developed for the monolingual- 
English student are appropriate for making decisions about 
bilingual or limited-English proficient students character- 
istically found in such programs, and of minority groups 
who tend to be overrepresented there. Some educators believe 
that many tests are intrinsically unfair to minorities because 
the values they reflect are those of the majority only (Cer- 
vantes, 1975). Others, however, hold that tests of cultural- 
ly defined content and vocabularv are not biased because 
achievement itself is language and culture specific (Ebel, 1975). 
But the impetus for testing continues: 

The problem now becomes not whether to test^bilingual 
students, but rather how to do it in a manner that 
accurately assesses their specific abilities and in 
a manner that does not create a bias either aaainst 
them or in favor of them (Coooer, 197 8, p. 2, italics 
original) . 

We turn attention specifically to assessment in Spanish- 
English bilingual programs at the primary level, and encounter 
two factors which strongly mitigate against simple solutions 
to* the problems noted above. The first- is that exceedingly 
few instruments are available at present which are both cul- 
turally appropriate and technically sound for this purpose. 
"The problems are particularly acute with respect to English - 
language measure, but are often eaually pervasive in instruments 



ERLC 



6 



that are simply translations from Enqlish language versions" 
(Burry, 1979, p. 8). The second is that Enalish-lanauage in- 
struction 1 in reading, listening comprehension, and vocabulary 
may be intrinsically more difficult for Spanish-speaking chil« 
dren than for thexr native Englifli-speaking counterparts 
because of the increased rhythmic and phonological complexity 
of English. Fundamental linguistic skills for understanding 
Spanish are frequently inadeouate for comprehending Enalish. 
Even a relatively simple phrase like "I c'n take it home 
fer ya," (/&yknteyktth6wmf £ryt*/ for the English listener) 
is likely to be heard by the native Spanish-sneaking child 
as /'aintekrcmfia*/, resulting in the obliteration of six 
out of seven words in the sentence (Matluck & Mace, 1972) . 
The quantity of purely linguistic differences between Snanish 
and English suggest that the Spanish-speaking child is at no 
small disadvantage? especially in the primary grades, ap- 
propriate language skills testing must not ignore such dif- 
ficulties • 

The Comprehensive Tests of Basic Skills /Spanish (CTBS, 
1974/19 78) , is in large measure a direct translation o^ its 
English counterpart, which has befen widely used as a primary 
skills evaluation tool. The CTBS/S has been presented as a 
major attempt to meet the needs of native Spanish-soeaking 
children (Finch, 1979). With such a test, the teacher can 
select the language appropriate for a fchild with some assur- 
ance that the instrument is valid, reliable, and unbiased 
(Hoepfner & Christen, 1979). Thus, the C r 'BS and CTBS/S 



should provide a good vehicle to examine individual per- 
formance patterns in either language for students in bi- 
lingual programs. However, recent evidence based on the per- 
formance of English- and Spanish-speaking Dupils suggests 
that the tests contain multiple sources of bias (McArthur, 
1981) , so a particularly interesting situation for research 
obtains when both versions of the CTBS are administered 
to the same children. That is, if a grouo of children who 
possess similar levels of knowledge in both English and 
Spanish are tested on both, instruments, will individual per- 
formances be the same across the two? VJill the results of 
such dual language testing reflect patterns which can be 
interpreted as the direct result of item bias? Will direct 
translation hold up as a viable strategy for fair testina of 
primary pupils in Spanish as well as English? 

METHODS 

Subjects 

As part of a larger studv (CSE, 1979), almost 1200 chil- 
dren in bilingual education programs in 26 school districts 
spread over five southwestern states were administered a 
series of educational achievement tests by their teachers. 
Programs were designed to provide instruction in reading and 
mathematics at the upper primary level. Teacher reports 
from these programs indicate that the time spent using Span- 
ish as the language of instruction was approximately equal 
to the time spent using English. Ninety- three percent of 

s 



the program teachers had earned at least a BA or BS- 94% 
were full-time employees of the school district, and 88% 
had. prior experience in bilingual education. Assignment of 
students to these soecial programs relied primarily on teacher 
evaluations and language dominance tests. Achievement tests 
were infrequently used to determine remediation placement, 
and intelligence test scores were generally excludefd alto- 
gether from placement considerations. Thus the proqrams 
represented a major effort, competently staffed, to provide 
special attention in a bilingual settino to student education- 
al needs • Most of the str.dents were rated by their teachers 
as having some skills in both English and Spanish. Overall 
only one child in ten from these classes was considered mono- 
lingual Spanish while only one in nine was rated as mono- 
lingual English. 

Instruments 

While a large number of instruments were used in the inves- 
tigation of programs, only the CTBS is of concern in the 
present studv. It was selected because test content between 
the two language versions is virtually identical. The C^BS- 
Spanish was the first test by a major publisher to be sub- 
jected to a four-step editorial procedure desiqned to reduce 
bias; included were studies of content validity, application 
of editorial guidelines in item construction, reviews for bias, 
and separate ethnic group pilot studies . The developers of 
the Spanish-languaqe version tried to keeo the test content 
and measurement features intact, thus building a test which 



was similar in rationale, administration, and interpretation 
to its parent version in English, What differences exist 
are the result primarily of problems of literal translation. 

The children in the study were given a large number of 
standardized tests of achievement during the . course of the 
regular school year by their teachers. With regard to the 
CTBS, the important instruction made to teachers was that 
they decide in advance on an individual basis whether each 
child would receive the English-language or Spanish-languaae 
version of the test. This decision was left totally to the 
discretion an°d best judgment of the classroom teachers. A 
total of 1162 completed test forms were returned, 814 in Enq- 
lish and 348 in Spanish. Fifty-eight students in the sample 
were found to have been tested in both languaqes; that is, 
one student In every t nineteen was given both forms of the 
test because the teachers felt unable to distinguish in advance 
which language these students should be tested in, No evi- 
dence is available to suacrest that any selection bias or 
other external circumstance might have contributed to obtain- 
ing this sample. Order of administration was apparently 
random. For purposes of this report, only the Vocabulary 
suhscale of test level C, consisting of 33 items ^selected 

in response to the teacher 's verbal directions, is considered. 

* 

Methods of analysis 

Two techniques for analysis of response Patterns were util- 
ized in this study. The first relies on the work of Sato (1980) 
and colleagues in Japan; they have generated a svstematic method 
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of appraisal of test performance based on the S-P (Student- 
Problem) Chart, a matrix of right and wrong answers, coded 1 
or 0, for each respondent for each item, ^he N x n matrix 
has the additional characteristics that students have been sorted 
by descending total score and items have been sorted by increas- 
ing difficulty. Thus the top row of the S-P Chart is a repre- 
sentation of the pattern of correct and incorrect resoonses 
to this sample of items by the most caoable student in the 'group , 
the bottow row by the least capable. The left-hand column shows 
the pattern of responses to the easiest item in the set of items, 
and right-hand column shows the most difficult. From this 
matrix are generated two statistics, one related to the aroup 
pattern for the group as a whole, the other related to indi- 
vidual performance vis-a-yis both the group and the configura- 
tionof items, f6r each individual. The first is an "index 
of discrepancy , " D* , which ranges from COO for a matrix of 
perfect symmetry between student capabilities and item dif- 
ficulties, to 1.00 for a matrix reoresentinq exclusively random 
responding. 1 The second is a "caution index," c^, which ranaes 
from 0.00 for an individual whose response pattern is per- 
fectly fitted to that reflected in the order of itam 



D * = A (N,n,o) 
A B (N,n,o) 

where the numerator is a discrepancy between cumulative 
probability ogives obtained from the S-^ chart, and the 
denominator is an analogous discrepancy as modeled by 
cumulative binomial distributions, both with the same 
number of cases, number of items, and average oassinq 
rate. (Sato, 1980) . 
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difficulties as determined by the groiio, to 1.00 for an individ- 
ual whose pattern of responses is totally antithetical to the 
order of item difficulties, and thus is ouite unlike the repre- 
sentative average respondent in the group. 2 

The second analytic tool used in this study is a statistic 
from Goodman and Kruskal called lambda, which has been apolied 
elsewhere to the detection of differences in response patterns 
in testing (Veale & Forman, 1976) . Here the focus is or. dif- 
ferences between groups in the attractiveness of incorrect re- 
sponses within the multiple-choice format of one correct and 
three incorrect responses per item. Lambda is an index of the 
pattern of choice for the incorrect responses. Jf the value of 
lambda is 0.00, the two groups use about the same pattern of 
selection of the incorrect responses. As the value increases, 
one group ^s using a different strategy for selection o* incor- 
rect responses than the otherr. The computation of lambda is 
independent of the actual proportions within each qroup who 
select the correct response to the item. In this paper, values 
of lambda above .10 are considered notpworthv. 3 



cov(x.- . Y .) 
c i = 1 " cov(uij, Yj) 

where the numerator is the covariance over problems of the 

i-th student's score on the j-th problem with the number of 

students who correctly answer that *j-th problem, and the 

denominator is the covariance over problems of the i-th 

ideal student's score on the j-th problem with the number 

of students who correctly answer that i-th problem (Sato, 1980). 

Imax.fjk - max.f # k 
* N - max.f # k 



:.f^. is the larger frequency of the two arouos for 
.e wrona choice, max.f v is the larger marginal fre- 



where max, 

any single ~*.w**w w*.w*w^, . ~ y 
quency of the two groups across all wrona choices, and M 
is the total number of observations, 
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Details of the computation and use of these approaches in the 
context of testing and item bias detection research have been 

A 

set out elsewhere (McArthur, 1981) . The usual test-retest and 
reliability statistics are not appropriate* here , because of the 
attention to deciphering sDecific Derformance patterns rather 
than whole-grouD performance. 

Hypotheses 

.Because of process of respondent selection, soecific hyoo- 
ies about their performance* on the English-language and Span- 
ish^language^ve^sions of the Vocabulary subtest were, first, 
that the achieved score£^e£ween tests would be Perfectly cor- 
related. , Additionally, the S-P char^-f^r^^e two versions would 
be similar, as shown by equal indices of discreoaBeyvJJ^ At 
the level of the individual respondent, it was hypothesized thatT 
the achieyed total score in English would equal the achieved 
total in Snanish, and that the caution index generated for eafch 
individual in the English-language .S-P chart would be eaual to 
the caution index obtained by the same individual from the 
Spanish- language S-P chart. 

RESULTS 

Total scores on the English-language Vocabulary subtest 
averaged 75.^4% correct with a range of 6 - 33. On the Spanish- 
language version, the averaqe was 37.56% correct with a range 
of 4 - 25. The total scores are significantly (p<.05) correla- 
ted, r s .48. Median improvement from Spanish to English is 
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13 answers correct. Only three of the 58 oarticipants did not 
show improvement in their total scores from Soa^sh to Enqlish. 

Two of the 33 items yielded hiaher percentages of correct 
responses in the Spanish- language version than in the Enalish. 
For the remainder of the items, students were able to select the 
correct response less freauently in the Soanish-languaqe version, 
often by substantial margins. The ratio of Spanish correct to 
English correct for each item is shown in the first column of 
Table 1. The consistency with which students oicked the correct 

Insert Table 1 about here 

answer in both languages ranged from moderately hicrh (65% of the 
respondents chose the correct answer to item 8 in both lanquaaes) 
to very low (only 7% chose the correct response to item 31 in 
both languages) . The consistency of selection of incorrect 
responses »was generally extremely low, reaching 14% for items 
^24 an&_31. The proportions of joint correct and joint incorrect 
proportions are showri in- columns 2 and 3 of Table 1. 

Those incorrect answers to items which garnered at fceast 
10% more responses than the next most frequently chosen incorrect 
response w*re termed "popular distractors" . Three popular dis- 
tractor items were found in the English-language version, while 
twelve were found in the Spanish. The average percentage of 
respondents who chose the correct answer to an item £n English 
but were swayed to choose the pooular distractor (incorrert) 
response to that same item in Spanish was 35%. The reverse, 
choosing a popular distractor response in English although se- 
lecting the correct resoonse to that same item in Spanish was 30%. 
Whether a specific item contained a oooular distractor, and if 
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so the percentage of respondents correct on the same item in the 
other language but who chose that poDUlar distractor, is in- 
dicated in the next four columns of Table 1. 

The data to this point quite clearly indicate that the Span- 
ish-language version of the C^BS presented a far more difficult 
task for these respondents than did the English-language version. 
Only infrequently did any vocabulary item from one version have 
both an equal Dercentage of incorrect selections. Examination 
of the S-P charts is necessary to show whether the difference 
in performance patterns is svstematic. 

The Spanish-language version generated a D* of .53, a rela- 
tively high level of randomness of responses, while the English- 
language version yielded a D* of .24, reflecting a much more or- 
derly fit of subject capabilities to item difficulties. Mo 
exact test of significance exists for the size of, or differences 
between, D* values, but in this instance thev represent con- 
figurations of the S-p charts which are distinctly different 
visually. The difference is supported by reference to the 
caution indices which for individual respondents to the English- 
language version averaged .17, but to the SDanish-languaae ver- 
sion .25. That is, on average the respondents were more consis- 
tent in selecting correct answers to easv items and incorrect 
answers to difficult items in the English-lanauage version. In 
fact, the number of respondents with caution indices of 0.00 is 
much higher in English. Of particular interest is that the cor- 
relation between the two indices comDUted across the 58 par- 
ticiDants is nonsignificant. Changes in caution indices from 
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one language version to the other are uncor related. 

The commutation of lambda , which details differences in se- 
lection patterns for wrong answers , showed that twelve out of 
33 items had large discrepancies in the obtained configuration. 
That is, for a large number of items, the respondents shifted 
their choice from one incorrect answer to another across lan- 
guage versions, rather than picking the same incorrect responses 
on both occasions. The last column of Table 1 indicates those 
items with such shifts in incorrect answers. 

DISCUSSION 

The findings of this study in general comport with earlier 
research on the CTBS in English and Spanish using independent 
groups of bilingual program respondents (^cArthur, 1981) . The 
distributions of total subscale scores, the higher n* indices 
for the Spanish-language version, and the number of popular dis- 
tractors and of lambda values exceeding .10 are all similar. 
That the two versions of the test do not produce ecrual outcomes 
even when the actual respondents are identical seems clear from 
the present data. If there was to have been eauivalence of total 
subscale scores, of group or individual patterns of correct scores, 
or of selection of wrong answers between the Enqlish- and Spanish- 
language versions, the number of discrepancies emerging from the 
statistical computations would have been far smaller. In its 
present configuration, these data suggest that children do not 
show the same performance patterns in response to the two ver- 
sions of the test. Review of data contained in Table 1 suggests 
that many of the items may be suspected of somehow biasing the 
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choice of correct response, and that such potentiallv biasing 
items are more prevalent in the Scanish-languaae version. ■» 

The relatively small number of individuals represented in 
this study makes these results necessarily tentative: they are 
presented neither as a representation of majority vs • minoritv 
responses to a specific test, nor as an indication in any way 
of a measure of "true ability, among bilingual program students. 
Rather, the unusual trial 'of a purportedly decent test in two 
languages, a purportedly equal-ability student sample, and a 
classroom experience for that sample equally divided into the 
use of English and Spanish, demands thoughtful attention to the 
appraisal of testing. In the present investigation, one weak- 
ness is the absence of an independent and unambiguous assessment 
of bilingual capability, and the ensuing reliance on the accura- 
cy of teacher selection of students equally competent in two 
languages. DeAvila and Duncan (1978) have pointed out numer- 
ous shortcomings in teacher ratings of language competence. 
However, for this study, students were not drawn for 'their equal- 
ly high abilities or for the purposes of assembling a homogene- 
ous sample, but only for their language abilities to be equally 
high or low in both languages. Nothing is known about the 
relative levels of exposure to English or Spanish outside the 
school , nor about the relative strenaths and weaknesses of the 
texts in both languages used in the program. However, the 
teachers 1 close personal supervision of students and the even 
division between English and Spanish as the lanouaoe of instruc- 
tion in these programs suacrest that the childrens 1 levels of 
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readiness for vocabulary would be roughly similar. Another 
weakness is the relatively small number of items included in 
this investigation. However, the CTBS aooears to represent the 
state of the art in English/Soanish testing of vocabulary skill 
at this level , and no other instrument is known to be a closer 
approximation to neutrality. The present results suonort the 
contention that the method of direct translation from English 
to Spanish for bilingual vocabulary testing may not be fully 
adequate for the needs of the bilingual program student. 
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Table 1 

Summary of Findings for the CTBS and CTBS/S 
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correct to 
English 
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wrong 
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